In [1]:

    
from msdas import *
%pylab inline
reload(annotations)









    



Couldn't import dot_parser, loading of dot files will not be possible.
Populating the interactive namespace from numpy and matplotlib






    Out[1]:





<module 'msdas.annotations' from '/home/cokelaer/Work/github/msdas/src/msdas/annotations.pyc'>

Introduction

When reading an input file, the Entry and Entry_name may not be set at all. Besides, full sequence, go terms are not necesseraly provided. We retrieve uniprot entry names and all annotations within the annotations module



In [2]:

    
filename = yeast.get_yeast_filenames()[0]
r = readers.MassSpecReader(filename)









    



INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/alpha0.csv
WARNING:root:Some Phospho strings found in Sequence column. No Sequence_Phospho column found.Renaming Sequence into Sequence_Phospho
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost

Right now, this dataframe/MassSpecReader contains the data and some metadata but no information such as UniProt entry. Besides, GO terms and uniprot intact information could be retrieved from UniProt. The annotations module provides tools to automatically fetch this kind of information.

The input can be a filename or an existing MassSpecReader



In [3]:

    
a = annotations.Annotations(r, "YEAST", verbose=True)









    



INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Entry column not found in the dataframe. call get_uniprot_entries
INFO:root:Initialising UniProt service (REST)



In [4]:

    
a.annotations #empty for now



In [5]:

    
a._mapping # empty for now









    Out[5]:





{}



In [6]:

    
a.get_uniprot_entries()   # need a network connection. May take some seconds









    



INFO:root:Fetching uniprot accession numbers for 57 entries
INFO:root:Fetching uniprot accession numbers for 23 unique entries
WARNING:root:deprecated in version 1.3.1. Use mapping instead
INFO:root:getUserAgent: Begin
INFO:root:getUserAgent: user_agent: EBI-Sample-Client/ (services.pyc; Python 2.7.3; Linux) Python-requests/2.7.0
INFO:root:getUserAgent: End
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): www.uniprot.org



In [7]:

    
a._mapping









    Out[7]:





{u'DIG1_YEAST': [u'Q03063'],
 u'DIG2_YEAST': [u'Q03373'],
 u'FAR1_YEAST': [u'P21268'],
 u'FPS1_YEAST': [u'P23900'],
 u'FUS3_YEAST': [u'P16892'],
 u'GPA1_YEAST': [u'P08539'],
 u'GPD1_YEAST': [u'Q00055'],
 u'HOG1_YEAST': [u'P32485'],
 u'HOT1_YEAST': [u'Q03213'],
 u'PBS2_YEAST': [u'P08018'],
 u'PTP2_YEAST': [u'P29461'],
 u'RCK2_YEAST': [u'P38623'],
 u'SIC1_YEAST': [u'P38634'],
 u'SKO1_YEAST': [u'Q02100'],
 u'SLN1_YEAST': [u'P39928'],
 u'SSK1_YEAST': [u'Q07084'],
 u'SSK2_YEAST': [u'P53599'],
 u'STE11_YEAST': [u'P23561'],
 u'STE12_YEAST': [u'P13574'],
 u'STE20_YEAST': [u'Q03497'],
 u'STE2_YEAST': [u'D6VTK4'],
 u'STE50_YEAST': [u'P25344'],
 u'TEC1_YEAST': [u'P18412']}



In [8]:

    
a.df[['Protein', 'Psite', 'Entry']].ix[0:10]









    Out[8]:






  
    
      
      Protein
      Psite
      Entry
    
  
  
    
      0
      DIG1
      S126+S127
      Q03063
    
    
      1
      DIG1
      S142
      Q03063
    
    
      2
      DIG1
      S272
      Q03063
    
    
      3
      DIG1
      S272^S275
      Q03063
    
    
      4
      DIG1
      S272^T277^S279
      Q03063
    
    
      5
      DIG1
      S330
      Q03063
    
    
      6
      DIG1
      S395
      Q03063
    
    
      7
      DIG2
      S225
      Q03373
    
    
      8
      DIG2
      S84
      Q03373
    
    
      9
      DIG2
      T83
      Q03373
    
    
      10
      FAR1
      S114
      P21268



In [8]:



In [9]:

    
a.set_annotations()









    



INFO:root:Fectching information from uniprot. Takes some time
INFO:root:fetching information from uniprot for 23 entries
INFO:root:uniprot.get_df 1/1
WARNING:root:column could not be parsed. Protein families
WARNING:root:column could not be parsed. interactor
WARNING:root:column could not be parsed. Subcellular location
INFO:root:Fectching 23
INFO:root:Annotations have been loaded. You can save the annotations dataframe attribute using x.to_pickle('annotations.pkl')  Next time, you could just load if using 

     >>> m = readers.MassSpecReader(filename, mode='yeast')
     >>>  m.read_annotations('annotations.pkl')



In [10]:

    
a.df[['Protein', 'Psite', 'Entry']].ix[0:10]









    Out[10]:






  
    
      
      Protein
      Psite
      Entry
    
  
  
    
      0
      DIG1
      S126+S127
      Q03063
    
    
      1
      DIG1
      S142
      Q03063
    
    
      2
      DIG1
      S272
      Q03063
    
    
      3
      DIG1
      S272^S275
      Q03063
    
    
      4
      DIG1
      S272^T277^S279
      Q03063
    
    
      5
      DIG1
      S330
      Q03063
    
    
      6
      DIG1
      S395
      Q03063
    
    
      7
      DIG2
      S225
      Q03373
    
    
      8
      DIG2
      S84
      Q03373
    
    
      9
      DIG2
      T83
      Q03373
    
    
      10
      FAR1
      S114
      P21268

	Protein	Psite	Entry
0	DIG1	S126+S127	Q03063
1	DIG1	S142	Q03063
2	DIG1	S272	Q03063
3	DIG1	S272^S275	Q03063
4	DIG1	S272^T277^S279	Q03063
5	DIG1	S330	Q03063
6	DIG1	S395	Q03063
7	DIG2	S225	Q03373
8	DIG2	S84	Q03373
9	DIG2	T83	Q03373
10	FAR1	S114	P21268